We are interested in exploring the White Wine Quality data set, which can be downloaded from here https://onlinecourses.science.psu.edu/stat857/node/223
In particular, we wish to exploratory data analysis to determine which chemical and physical features, if any, impact the quality of the wine as determined by three people.
The number of records and variables in the data are as follows:
## [1] 4898 13
There are 4898 observations.
The variables and their data types are:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are 11 independent variables and 1 dependent variable (Wine Quality). Input variables (based on physicochemical tests):
Output variable (based on sensory data):
We will now consider each variable independently.
First we will examine how the quality of the wine varies.
As you can see from the box plot above, the quality of wine has a minimum value of 3 and a maximum value of 9.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The median, mean and 3rd quartile are almost aligned. This is due to the integer nature of the data and the fact that a quality of 6 is the modal value as demonstrated in the bar graph below.
We will first examine the free and total sulphur dioxides, and Sulphates I have scaled the sulphates variable by 1000 so that all three have the same unit of (mg/dm^3).
We can see from the top plot that the amount of free sulphur dioxide in the wine is between 0 and 100. If we remove the top and bottom 1% of readings, we recover the a distribution that appears to be normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 33.00 34.63 45.00 80.00
From the Total Sulphur Dioxide plot, we see a distribution that is almost normal. If we remove the top 1% again, we get the following plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 137.1 166.0 241.0
Finally, if we look at the Sulphates plot, we can see some interesting peaks occurring when we have a bin width of 15. This suggests a multimodal distribution and could be due to the measuring equipment.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 220.0 410.0 470.0 489.8 550.0 1080.0
We will now examine the four input variables associated with the acidity of the wine: fixed.acidity, volatile.acidity, citric.acid and pH. The first three variables have the same unit (g/dm^3) and we can compare them directly to see which is the most common. We can also look at the overall acidity via the pH measurement.
Starting with Fixed Acidity, we see that this dominates the other two acids and is an order of magnitude larger.
It appears to have a normal distribution, which is indeed the case when you consider the mean, median, 1st and 3rd quartile ranges.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Now looking at the Volatile Acidity, we see that the distribution has a positive skew. Performing a transformation using logs gives the following plot.
Looking at the Q - Q Norm plot for the transformed data, we can see that it approximates a normal distribution as the line is approximately linear.
Now looking at the Citric Acid, we can see what appears to be a normal distribution, but with an interesting peak at 0.5 and a smaller peak at 0.75.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Finally looking at the pH, we can see what appears to be a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Finally, we will examine the Sugar and Chloride content, density and alcohol percentage individually.
As with the sulphates, the residual sugar has a positive skew and re-scaling gives the following plot.
This plot shows that, once we apply the transformation, we have a bi-modal distribution.
Now looking at the chlorides, we can apply the same transformation.
Looking at the density, we can see a what appears to be a normal distribution. By removing the top and bottom 0.1% gives the following plot.
Finally, looking at the alcohol percentage by volume, we have a very broad distribution. Most wines have an alcohol percentage between 9 and 13.
Having considered all of the variables on an individual basis, we now compare the variables with each other and look for correlations.
We start this by using ggpairs.This shows a scatter plot in the bottom left, a density plot along the diagonal and correlations in the top right.
From the ggpairs plot, we can identify several interesting results.
The seem to make sense. Dissolved sugar would increase the density but a higher alcohol content would decrease it due alcohol having a lower density than water. Also, as there more free SO2 you would expect there to be more SO2 in total.
Looking at the strongest correlations, we start with the correlation between density and residual sugar.
You can see there is a positive correlation but the variance is larger at lower densities.
Now looking at Total Sulphur Dioxide against Free Sulphur Dioxide, we can see that there is also a positive correlation visible here. In this case the variability increases as the total increases.
Finally considering Alcohol against Density, we can see a negative correlation. We can also see that the measure of Alcohol is limited by its order of accuracy. For most wines we have only one decimal place.
Here we present an ordered list of correlations from most negative to most positive. The correlations are calculated using Pearson’s method.
## density chlorides volatile.acidity
## -0.307123313 -0.209934411 -0.194722969
## total.sulfur.dioxide fixed.acidity residual.sugar
## -0.174737218 -0.113662831 -0.097576829
## citric.acid free.sulfur.dioxide sulphates
## -0.009209091 0.008158067 0.053677877
## pH alcohol
## 0.099427246 0.435574715
We can see that density and chlorides have a negative correlation with the quality, whereas the alcohol percentage by volume has a positive correlation. One thing to note from the previous plot is that alcohol and density are strongly correlated.
The strongest correlation is between alcohol and quality. The following box plot demonstrates the correlation, but it also suggests that this relationship is not linear, as we would expect. The result is also distorted by the small number of wines with a rating of 9 and the absence of wines with ratings below 3.
Now considering Density against Quality, we can see the slight negative correlation.
Now considering Chlorides against Quality, we can see the slight negative correlation. You can also see that there are lot more outliers in the Chloride data.
In this section, we considered bi-variate relationships. We identified the three strongest correlations:
We examined scatter plots to explore these correlations and look at the variance.
We then looked at the correlations between the independent parameters and quality. While there are no really strong correlations, density and alcohol were larger than all the others. We will expand on these correlations in the next section.
Comparing alcohol and density coloured by quality gives the following scatter diagram.
As you can see, it is not clear if there is a strong relationship or not. However, if we bin the quality into three ratings:
Applying this rating, we get the following counts.
The largest correlation was between density and alcohol. The plot below shows this correlation while colouring the points by the new rating. These results in a clearer plot than just using the quality factor.
In the plot below we can see that the negative correlation between Alcohol and Chlorides more clearly
Now looking at the second strongest correlation, pH, coloured by quality gives the following:
Another interesting correlation is between density and residual sugar. You can see the strong positive correlation between the two, but also the negative correlation between density and rating.
Finally, looking at Alcohol and Residual Sugar, we see the trend of higher alcohol having higher quality, but there also appears to be a trend of low alcohol and low sugar indicating a poor quality wine.
The first plot I will consider in this final section is how Alcohol and affects the rating by considering a box plot.
This box plot shows that a better rating has a higher Alcohol content.
This is the correlation between quality and alcohol.
## [1] 0.4355747
The Poor Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
The Good Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
The Excellent Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
Next we consider the two components with the strongest correlation to quality: Density and Alcohol in a scatter plot. Here the trend lines show a clear separation at low Density and high Alcohol that reinforces the hypothesis that higher Alcohol content is regarded as better quality. We are also seeing the strong correlation between Alcohol and Density.
In the final plot, we return to just looking at how alcohol and quality rating are related. This density count plot clearly shows the Excellent wines, on average, have a higher alcohol content. This can be seen in the density plots and the means.
After performing the EDA on the White Wine Quality data set, I have come to the following conclusion: While there are several independent variables that have strong correlations with Quality, and Alcohol appears to be the strongest contributor, the picture is more complicated than simple saying “A high alcohol content indicates a good quality wine”.
While the rating system goes from 1-10, there are no wines with ratings of 1, 2 or 10 and the integer rating results in binning which would not occur on a percentage system. The data set would benefit from more observations as there were only 4 wines that rated 9 and these are clustered. These could indicate a trend or these could all be outliers.
If quality was a numerical value, it would be easier to develop linear models to predict its value based on the input variables.
It was enjoyable to look at new data set, but it was hard too as there were no clear indicators of quality when compared to the indicators for price in the diamonds data set.
Creating some of the more complicated plots such as the last one too a little longer then when I have used Matlplotlib, but, like Python, there is a large community to provide templates and code snippets.
When I was stuck, I went searching for examples from other people analysing this data set to inspire and help me. Two very good examples can be found below. While I acknowledge their ideas, the code presented here is my own.
Example project https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
Wine Quality Info https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
Inspired by https://github.com/keymanesh/Udacity--Data-Analysis-with-R
Final plot using template from here http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization